Introduction

The aim of this study is to predict tomorrow’s hourly electricity price data (market clearing prices (MCP) per each hour) with using predictive AR and MA models and regression models . The data is acquired from EPİAŞ, from 1st of June, 2015 till the 21th of June, 2021. The last 30 days will be used for testing the models.

Data Reading & Preprocessing

Importing necessary libraries

library("readxl")
library("ggplot2")
library("tidyverse")
library("zoo")
library(plotly)
library("data.table")
library(forecast)
library(ggcorrplot)
library(stats)
library(urca)
library(lubridate)

Target variable and training variables is gathered as an csv file from the EPİAŞ:

df <- fread("mcp_with_variables.csv")

str(df)
## Classes 'data.table' and 'data.frame':   53112 obs. of  19 variables:
##  $ date       : IDate, format: "2015-06-01" "2015-06-01" ...
##  $ hour       : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ mcp_try    : num  103.8 88 49.1 33.8 28.9 ...
##  $ mcp_dollars: num  39 33.1 18.4 12.7 10.9 ...
##  $ mcp_euro   : num  35.61 30.19 16.83 11.59 9.93 ...
##  $ load_plan  : int  25600 23900 23000 22600 22500 21900 22300 25000 30200 33200 ...
##  $ total_prod : num  20657 20292 19666 19650 19664 ...
##  $ natural_gas: num  6198 6391 6403 6388 6300 ...
##  $ wind       : num  169 166 172 182 189 ...
##  $ lignite    : num  2799 2799 2799 2799 2799 ...
##  $ black_coal : num  742 630 630 630 630 630 630 742 742 714 ...
##  $ import_coal: num  3901 3708 3179 3112 3113 ...
##  $ fuel_oil   : num  417 417 417 417 417 417 417 417 462 462 ...
##  $ geothermal : num  74.8 74.8 74.8 74.8 74.8 74.8 74.8 74.8 74.8 74.8 ...
##  $ dam        : num  5057 4833 4726 4824 4798 ...
##  $ naphta     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ biomass    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ river      : num  1159 1134 1125 1082 1203 ...
##  $ other      : num  140 140 140 140 140 140 140 140 140 140 ...
##  - attr(*, ".internal.selfref")=<externalptr>
drop_na(df)
#converting to data table
setDT(df)

#combining Date and Hour column
#df$NewDate <-ymd(df$date)+as.xts(df$hour)

df$year<-format(df$date, "%y")
df$day<-format(df$date,"%m/%d")
df$month<-format(df$date, "%m")

df[,NewDate:=ymd(date)+dhours(hour)]

There are no missing values.

1. Choose the currency for MCP

Visualizing a sample of currency data (TL, EUR, USD) and obtaining the correlation matrix:

ts.plot(df$mcp_try[50000:52000])

ts.plot(df$mcp_dollars[50000:52000])

ts.plot(df$mcp_euro[50000:52000])

cor(df[,3:5])
##               mcp_try mcp_dollars  mcp_euro
## mcp_try     1.0000000   0.5509976 0.5108139
## mcp_dollars 0.5509976   1.0000000 0.9920998
## mcp_euro    0.5108139   0.9920998 1.0000000

Time series sample visualization is showing that MCP is having similar seasonality and trend in all three currencies. However, TL price is not significantly correlated with (approx 0.50) EUR and USD (EUR and USD is %99 correlated). This can indicate that market prices are regulated by EPİAŞ, the government, etc, and not increased or decreased with sudden TRY/USD exchange rates. Therefore, TL MCP will be utilized further in this study.

2. Ploting the time series of MCP

p <- ggplot(data = df, aes(x = NewDate, y = mcp_try ,group=1))+
  geom_line(color = '#007cc3') + labs(title = 'Hourly Electricity Price in Turkey between 2015-2021 (TRY)', x= "Date", y = "Hourly Electricity Price(TRY)") + theme(text=element_text(color="#004785"),axis.text=element_text(color="#004785"))
#fig <- ggplotly(p)

#fig <- fig %>% layout(dragmode = "pan")

#fig
p

It can be clearly seen that the mean and variance is increasing between 2015-2020. Therefore the data is not stationary. It has a positive trend. To conclude whether the data is stationary or not, KPSS Unit Root Test is used:

KPSS_test= ur.kpss(df$mcp_try)
summary(KPSS_test)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 19 lags. 
## 
## Value of test-statistic is: 174.2189 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
plot(KPSS_test)

The null hypothesis is the data is stationary, and the value of test-statistic exceeds critical values significantly. Therefore, the null hypothesis is rejected and can be concluded that the data is non-stationary.

2. Plotting the autocorrelation function of the MCP

acf(df$mcp_try)

The lags are is exponentially decreasing and increasing in 24 lags of period. The decrease and increase indicate that trend exist in the data. The peaks are observed at lag 0,24 and 48, indicating that the data has daily seasonality(24 hours of period).

METHOD A: FORECASTING WITH TIME SERIES ANALYSIS

1. Time series decomposition

The visualization of monthly sample the time series data from June 1 2015 to June 30 2015:

ts.plot(df$mcp_try[1:720])

For the preliminary analysis, it can be seen that the data has a weekly and daily seasonality(daily seasonality is already found). For further understanding, decompositions at different levels(hourly, daily, weekly, monthly) are required.

Hourly Decomposition

df_new <- ts(df$mcp_try,frequency=24)
ts.plot(df$mcp_try[0:72])

df_additive <- decompose(df_new,type="additive")
plot(df_additive)

From the figure, seasonality is not clearly observable, but from the 3 days figure above, we know that prices are lower between 03.00-09.00 everyday , meaning that hourly seasonality exists.

Daily Decomposition

To analyze daily trends and seasonality, the hourly data is grouped into daily data by averaging 24 hours’ period.

daily_df <- df %>% group_by(year,day) %>% summarize(DailyAvg = mean(`mcp_try`))
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
df_new <- ts(daily_df$DailyAvg,frequency=7)

df_additive <- decompose(df_new,type="additive")
plot(df_additive)

The trend and seasonality is clearly observable from the above figure. Random component is similar to white noise, but large deviations can be observed in the special days, i.e. national holidays.

Monhly Decomposition

To analyze daily trends and seasonality, the daily data is grouped into monthly data by averaging daily period.

monthly_df <- df %>% group_by(year,month,day) %>% summarize(DailyAvg = mean(`mcp_try`))
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
monthly_df <- monthly_df %>% group_by(year,month) %>% summarize(Monthly = mean(DailyAvg))
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
df_new <- ts(monthly_df$Monthly,frequency=12)

df_additive <- decompose(df_new,type="additive")
plot(df_additive)

For the monthly data, the seasonality and the trend line are very clearly observable.

Since the hourly and daily seasonality exist, there is a pattern at every 168 hours(7 days). The AR and MA models will be built upon this pattern.

2. Neighborhood search of the initial model (i.e. ARIMA/SARIMA).

After we found out that there is a pattern at every 168 hours(7 days), the decomposition of the time-series data:

df_new <- ts(df$mcp_try,frequency=168)

df_additive <- decompose(df_new,type="additive")
plot(df_additive)

ts.plot(df$mcp_try[1:144])

For 7 days’ period from monday to sunday(1 June-7 June 2015), the electricity prices is has similar pattern not only in weekdays, but also the weekends.From the decomposition data, trend and seasonality exists in the data, and should be eliminated for the predictive models. The noise is also similar to white noise, with large deviatations occuring in special days.

3. Deseasonalizing and Detrendng the Time-series and Applying AR Models

For deseasonalizing and detrendng the Time-series, the trend and seasonal components should be extracted from the consumption data:

df_deseasoned_detrended <- df_new-df_additive$seasonal-df_additive$trend

p <- ggplot(data = df_deseasoned_detrended, aes(x = df$NewDate, y = df_deseasoned_detrended,group=1))+
  geom_line(color = '#007cc3') + labs(title = 'Deaseasoned and Detrended Data', x= "Date", y = "Hourly Electricity Price(TRY)") + theme(text=element_text(color="#004785"),axis.text=element_text(color="#004785"))
p
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## Warning: Removed 168 row(s) containing missing values (geom_path).

The data is now more stationary and has no increasing or decreasing trend. To determine appropriate AR,MA, and ARMA models, KPSS Unit Root Test can be applied for checking the stationarity of the data for the differencing (d) parameter. Also, ACF and PACF plots should be investigated for p and q parameters:

KPSS_test= ur.kpss(df_deseasoned_detrended)
summary(KPSS_test)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 19 lags. 
## 
## Value of test-statistic is: 7e-04 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The null hypothesis is not rejected this time, since value of test-statistic is lower than the critical values. Therefore I can conclude that the data is now stationary.

acf(df_deseasoned_detrended,na.action=na.pass)

pacf(df_deseasoned_detrended,na.action=na.pass)

From the ACF figure, the q parameter should be between 1-7, also 24(not included because of computational comlexity). From the PACF figure, the p parameter should either be between 1-3. For the training, the test set(last 30 days) should be excluded from the dataset. Starting a neighborhood search with AR models:

Train-test set split:

df_train <- df_deseasoned_detrended[1:(length(df_deseasoned_detrended)-30*24)]
df_test <- df_deseasoned_detrended[(length(df_deseasoned_detrended)-30*24+1):length(df_deseasoned_detrended)]

AR Models:

model <- arima(df_train, order=c(1,0,0))
print(model)
## 
## Call:
## arima(x = df_train, order = c(1, 0, 0))
## 
## Coefficients:
##          ar1  intercept
##       0.6911     0.0042
## s.e.  0.0032     0.4309
## 
## sigma^2 estimated as 926.6:  log likelihood = -252894.2,  aic = 505794.5
AIC(model)
## [1] 505794.5
model <- arima(df_train, order=c(2,0,0))
print(model)
## 
## Call:
## arima(x = df_train, order = c(2, 0, 0))
## 
## Coefficients:
##          ar1     ar2  intercept
##       0.6743  0.0244     0.0042
## s.e.  0.0044  0.0044     0.4417
## 
## sigma^2 estimated as 926.1:  log likelihood = -252878.6,  aic = 505765.2
AIC(model)
## [1] 505765.2
model <- arima(df_train, order=c(3,0,0))
print(model)
## 
## Call:
## arima(x = df_train, order = c(3, 0, 0))
## 
## Coefficients:
##          ar1      ar2     ar3  intercept
##       0.6732  -0.0044  0.0428     0.0042
## s.e.  0.0044   0.0053  0.0044     0.4609
## 
## sigma^2 estimated as 924.4:  log likelihood = -252830.7,  aic = 505671.5
AIC(model)
## [1] 505671.5

The lowest AIC is achieved by p=3 and will be used in the ARMA model.

4. Applying MA Models

As mentioned before, from the PACF figure, the q parameter should either be 1-7 for the MA models. For computational complexity, q values will be between 1-5

model <- arima(df_train, order=c(0,0,1))
print(model)
## 
## Call:
## arima(x = df_train, order = c(0, 0, 1))
## 
## Coefficients:
##          ma1  intercept
##       0.5679     0.0042
## s.e.  0.0030     0.2313
## 
## sigma^2 estimated as 1139:  log likelihood = -258286.6,  aic = 516579.3
AIC(model)
## [1] 516579.3
model <- arima(df_train, order=c(0,0,2))
print(model)
## 
## Call:
## arima(x = df_train, order = c(0, 0, 2))
## 
## Coefficients:
##          ma1     ma2  intercept
##       0.6586  0.2932     0.0042
## s.e.  0.0042  0.0036     0.2725
## 
## sigma^2 estimated as 1019:  log likelihood = -255390.8,  aic = 510789.6
AIC(model)
## [1] 510789.6
model <- arima(df_train, order=c(0,0,3))
print(model)
## 
## Call:
## arima(x = df_train, order = c(0, 0, 3))
## 
## Coefficients:
##          ma1     ma2     ma3  intercept
##       0.6690  0.3975  0.2193     0.0042
## s.e.  0.0043  0.0043  0.0040     0.3104
## 
## sigma^2 estimated as 964.9:  log likelihood = -253953.8,  aic = 507917.7
AIC(model)
## [1] 507917.7
model <- arima(df_train, order=c(0,0,4))
print(model)
## 
## Call:
## arima(x = df_train, order = c(0, 0, 4))
## 
## Coefficients:
##          ma1     ma2     ma3     ma4  intercept
##       0.6728  0.4217  0.2931  0.1456     0.0042
## s.e.  0.0044  0.0050  0.0045  0.0042     0.3403
## 
## sigma^2 estimated as 943.8:  log likelihood = -253373.5,  aic = 506759
AIC(model)
## [1] 506759
model <- arima(df_train, order=c(0,0,5))
print(model)
## 
## Call:
## arima(x = df_train, order = c(0, 0, 5))
## 
## Coefficients:
##          ma1     ma2     ma3     ma4     ma5  intercept
##       0.6700  0.4289  0.3229  0.2050  0.1039     0.0042
## s.e.  0.0044  0.0051  0.0048  0.0048  0.0043     0.3648
## 
## sigma^2 estimated as 933.4:  log likelihood = -253084.5,  aic = 506183
AIC(model)
## [1] 506183

q=5 gives the best AIC(lowest). Increasing q further can be even better, but due to computational complexity, q=5 will be further utilized.

3. Comparison of AR and MA Models and Training and Prediction of ARMA Models

Starting with best models(models with lowest AIC):

model <- arima(df_train, order=c(3,0,5))
print(model)
## 
## Call:
## arima(x = df_train, order = c(3, 0, 5))
## 
## Coefficients:
##          ar1     ar2      ar3     ma1      ma2     ma3     ma4     ma5
##       0.4562  0.6593  -0.4184  0.2159  -0.5193  0.1338  0.0999  0.0250
## s.e.  0.0569  0.0508   0.0397  0.0569   0.0587  0.0094  0.0100  0.0085
##       intercept
##          0.0046
## s.e.     0.4185
## 
## sigma^2 estimated as 921.6:  log likelihood = -252752.4,  aic = 505524.8
AIC(model)
## [1] 505524.8
BIC(model)
## [1] 505613.4

The AIC value is better(lower) from the above models. Checking the residuals:

checkresiduals(model)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(3,0,5) with non-zero mean
## Q* = 95.872, df = 3, p-value < 2.2e-16
## 
## Model df: 9.   Total lags used: 12

The autocorrelation can be still observed, most at lag 24(yesterday’s same hour data); however, the residuals are normally distributed. Visual comparison of the prediction and test set would be a good idea. For the seasonality, I took the first 1424 period, since it has a replying pattern for each of the 336 data points. Also for trend component, I took the last 1424 trend values, since the last

model_forecast <- predict(model, n.ahead = 30*24)$pred
last_trend_value <-tail(df_additive$trend[!is.na(df_additive$trend)],30*24)

seasonality <- df_additive$seasonal[1:(30*24)]
forecast_combined <-model_forecast+last_trend_value+seasonality
df$forecast <- 0
df$forecast[(length(df_deseasoned_detrended)-30*24+1):length(df_deseasoned_detrended)] <-forecast_combined


p <- ggplot()+
  geom_line(data = df[52000:53112],aes(x = NewDate,y=mcp_try,color='Actual Data') ) +
  geom_line(data = df[(53112-30*14+1):53112],aes(x = NewDate,y=forecast,color='Prediction')) +
                     scale_color_manual(values = c(
    'Actual Data' = '#007cc3',
    'Prediction' = 'red')) +   labs(title = 'Hourly Electricity Price(TL/MWh) Prediction', x= "Date", y = "Hourly Electricity Price(TL/MWh)",color = 'Legend')

 theme(text=element_text(color="#004785"),axis.text=element_text(color="#004785"))
## List of 2
##  $ text     :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : chr "#004785"
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text:List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : chr "#004785"
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE
p

The model can predict daily and hourly seasonality, but can not predict the deviation from the mean after the 10 days, i.e the model is more accurate for predicting the first 10 days, but then the hourly deviations can not be predicted by the model. To evaulate the model and to measure overall accuracy of the model daily mean absolute percentage error for each date and weighted mean absolute percentage error (WMAPE) are calculated:

metrics <- df[(53112-30*24+1):53112] %>% group_by(year,day) %>% summarize(MAPE = mean(abs((mcp_try-forecast)/mcp_try)) * 100)
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
metrics

On the contrary of visual inspection of the real data and prediction, MAPE is stable in 30 days, with high MAPE’s are observed in 7 days/30 days. To further improve the model, the high deviations can be analyzed. However, for the sake of the study, forecasting with regression will be done. WMAPE of the model:

wmape <- abs(sum((df[(53112-30*24+1):53112]$mcp_try-df[(53112-24*30+1):53112]$forecast)*metrics$MAPE*100))/sum(df[(53112-24*30+1):53112]$mcp_try)
wmape
## [1] 0.5896703

WMAPE is approximately 6%.

METHOD B: FORECASTING WITH REGRESSION

Regression analysis would give a better model, since we have more than 10 variables related to MCP data. After splitting into test and train set, fitting the data with trend line, year and month variables and variables in the data(excluding EUR and USD prices):

#trend:
df_train <- df[1:(53112-30*24),]
df_test <- df[(53112-30*24+1):53112,]
df_train[,trend:=1:.N]

fit <- lm(mcp_try ~hour+month+load_plan+total_prod+natural_gas+wind+lignite+black_coal+import_coal+fuel_oil+geothermal+dam+naphta+biomass+river+other+year ,data=df_train)
summary(fit)
## 
## Call:
## lm(formula = mcp_try ~ hour + month + load_plan + total_prod + 
##     natural_gas + wind + lignite + black_coal + import_coal + 
##     fuel_oil + geothermal + dam + naphta + biomass + river + 
##     other + year, data = df_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -240.62  -26.44   -0.57   26.72 1680.02 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.003e+02  3.191e+00 -31.435  < 2e-16 ***
## hour        -4.689e-01  3.812e-02 -12.301  < 2e-16 ***
## month02     -6.974e+00  1.109e+00  -6.289 3.23e-10 ***
## month03      1.038e+01  1.276e+00   8.136 4.17e-16 ***
## month04      2.281e+01  1.571e+00  14.519  < 2e-16 ***
## month05      2.809e+01  1.633e+00  17.200  < 2e-16 ***
## month06      7.039e+00  1.375e+00   5.118 3.09e-07 ***
## month07     -1.368e+01  1.273e+00 -10.746  < 2e-16 ***
## month08      4.300e+00  1.263e+00   3.404 0.000664 ***
## month09      2.695e+01  1.280e+00  21.046  < 2e-16 ***
## month10      3.945e+01  1.302e+00  30.295  < 2e-16 ***
## month11      2.815e+01  1.427e+00  19.731  < 2e-16 ***
## month12      2.366e+01  1.430e+00  16.539  < 2e-16 ***
## load_plan    1.136e-03  1.169e-04   9.717  < 2e-16 ***
## total_prod   4.342e+01  8.075e+01   0.538 0.590793    
## natural_gas -4.342e+01  8.075e+01  -0.538 0.590822    
## wind        -4.342e+01  8.075e+01  -0.538 0.590766    
## lignite     -4.341e+01  8.075e+01  -0.538 0.590868    
## black_coal  -4.340e+01  8.075e+01  -0.537 0.590995    
## import_coal -4.340e+01  8.075e+01  -0.537 0.590983    
## fuel_oil    -4.342e+01  8.075e+01  -0.538 0.590808    
## geothermal  -4.348e+01  8.075e+01  -0.538 0.590287    
## dam         -4.341e+01  8.075e+01  -0.538 0.590864    
## naphta      -4.666e+01  8.075e+01  -0.578 0.563366    
## biomass     -4.325e+01  8.075e+01  -0.536 0.592233    
## river       -4.342e+01  8.075e+01  -0.538 0.590766    
## other       -4.345e+01  8.075e+01  -0.538 0.590540    
## year16      -4.526e-01  1.503e+00  -0.301 0.763241    
## year17       2.289e+01  2.096e+00  10.919  < 2e-16 ***
## year18       5.608e+01  3.300e+00  16.995  < 2e-16 ***
## year19       6.548e+01  4.266e+00  15.348  < 2e-16 ***
## year20       8.221e+01  5.393e+00  15.242  < 2e-16 ***
## year21       1.245e+02  7.408e+00  16.809  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 48.09 on 52359 degrees of freedom
## Multiple R-squared:  0.712,  Adjusted R-squared:  0.7119 
## F-statistic:  4046 on 32 and 52359 DF,  p-value: < 2.2e-16
#checkresiduals(fit)

Adjusted R-squared (0.69) gives a initial good-start,parameters with high p values will be discarded on the next model. Residuals are not checked for now, due to computational complexity freezes my computer:

fit <- lm(mcp_try ~hour+load_plan+year+month ,data=df_train)
summary(fit)
## 
## Call:
## lm(formula = mcp_try ~ hour + load_plan + year + month, data = df_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -259.01  -27.57    0.54   30.25 1674.91 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.930e+01  1.942e+00 -45.978  < 2e-16 ***
## hour        -4.159e-02  3.931e-02  -1.058    0.290    
## load_plan    6.560e-03  5.704e-05 115.001  < 2e-16 ***
## year16       9.873e+00  9.570e-01  10.316  < 2e-16 ***
## year17       2.066e+01  9.705e-01  21.282  < 2e-16 ***
## year18       8.578e+01  9.749e-01  87.992  < 2e-16 ***
## year19       1.154e+02  9.734e-01 118.518  < 2e-16 ***
## year20       1.351e+02  9.710e-01 139.149  < 2e-16 ***
## year21       1.752e+02  1.328e+00 131.957  < 2e-16 ***
## month02     -5.899e+00  1.148e+00  -5.139 2.77e-07 ***
## month03     -9.233e+00  1.126e+00  -8.201 2.44e-16 ***
## month04     -1.079e+01  1.151e+00  -9.375  < 2e-16 ***
## month05     -7.003e-01  1.159e+00  -0.604    0.546    
## month06      1.292e+01  1.159e+00  11.144  < 2e-16 ***
## month07      6.416e+00  1.147e+00   5.591 2.27e-08 ***
## month08      2.706e+01  1.149e+00  23.548  < 2e-16 ***
## month09      4.815e+01  1.151e+00  41.841  < 2e-16 ***
## month10      5.688e+01  1.152e+00  49.393  < 2e-16 ***
## month11      4.399e+01  1.154e+00  38.137  < 2e-16 ***
## month12      3.613e+01  1.142e+00  31.625  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 52.98 on 52372 degrees of freedom
## Multiple R-squared:  0.6505, Adjusted R-squared:  0.6503 
## F-statistic:  5129 on 19 and 52372 DF,  p-value: < 2.2e-16
#checkresiduals(fit)

Now all values except hour are significant. We intuitively know that hour is an important variable, so I decided to leave in the model for know. Introducting a ratio type variable, renewable energy generation/total_production:

df$ratio <- (df$wind+df$lignite + df$geothermal + df$biomass)/df$total_prod
df[,trend:=1:.N]

df_train <- df[1:(53112-30*24),]
df_test <- df[(53112-30*24+1):53112,]

fit <- lm(mcp_try ~hour+load_plan+year+month+ratio ,data=df_train)
summary(fit)
## 
## Call:
## lm(formula = mcp_try ~ hour + load_plan + year + month + ratio, 
##     data = df_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -243.71  -27.26   -0.15   29.59 1690.03 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -9.661e+00  2.436e+00  -3.966 7.31e-05 ***
## hour        -1.611e-01  3.840e-02  -4.195 2.73e-05 ***
## load_plan    5.073e-03  6.254e-05  81.116  < 2e-16 ***
## year16       2.879e+01  1.002e+00  28.746  < 2e-16 ***
## year17       4.696e+01  1.073e+00  43.757  < 2e-16 ***
## year18       1.267e+02  1.235e+00 102.653  < 2e-16 ***
## year19       1.604e+02  1.285e+00 124.823  < 2e-16 ***
## year20       1.780e+02  1.256e+00 141.739  < 2e-16 ***
## year21       2.324e+02  1.699e+00 136.788  < 2e-16 ***
## month02     -4.532e+00  1.120e+00  -4.048 5.18e-05 ***
## month03     -9.644e+00  1.098e+00  -8.784  < 2e-16 ***
## month04     -1.774e+01  1.130e+00 -15.699  < 2e-16 ***
## month05     -8.323e+00  1.139e+00  -7.306 2.80e-13 ***
## month06      1.139e+01  1.131e+00  10.073  < 2e-16 ***
## month07      1.060e+01  1.122e+00   9.445  < 2e-16 ***
## month08      3.399e+01  1.128e+00  30.124  < 2e-16 ***
## month09      5.047e+01  1.123e+00  44.941  < 2e-16 ***
## month10      5.745e+01  1.123e+00  51.157  < 2e-16 ***
## month11      4.938e+01  1.130e+00  43.710  < 2e-16 ***
## month12      4.654e+01  1.132e+00  41.113  < 2e-16 ***
## ratio       -2.757e+02  5.303e+00 -51.993  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.67 on 52371 degrees of freedom
## Multiple R-squared:  0.6676, Adjusted R-squared:  0.6675 
## F-statistic:  5259 on 20 and 52371 DF,  p-value: < 2.2e-16
checkresiduals(fit)

## 
##  Breusch-Godfrey test for serial correlation of order up to 24
## 
## data:  Residuals
## LM test = 35101, df = 24, p-value < 2.2e-16

All variables are significant for the model and the ratio of renewable prod./total prod. is also improved the R-squared values. Looking the resudials, they are evenly distributed and have zero mean. However, from the ACF, lagged variables should be introduced to increase the accuracy of the model. We already know that there is hourly, daily and weekly seasonality. Let’s try them for our predictive models:

df$target_lagged_1 <-shift(df$mcp_try, 1)
df$target_lagged_24 <-shift(df$mcp_try, 24)
df$target_lagged_168 <-shift(df$mcp_try, 168)
df$ratio <- (df$wind+df$lignite + df$geothermal + df$biomass)/df$total_prod
df[,trend:=1:.N]

df_train <- df[1:(53112-30*24),]
df_test <- df[(53112-30*24+1):53112,]

fit <- lm(mcp_try ~hour+load_plan+year+month+ratio+target_lagged_1+target_lagged_24+target_lagged_168 ,data=df_train)
summary(fit)
## 
## Call:
## lm(formula = mcp_try ~ hour + load_plan + year + month + ratio + 
##     target_lagged_1 + target_lagged_24 + target_lagged_168, data = df_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -431.96  -10.17   -0.26   11.02 1278.82 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        5.129e+00  1.448e+00   3.542 0.000398 ***
## hour              -4.815e-01  2.272e-02 -21.191  < 2e-16 ***
## load_plan          6.924e-04  3.993e-05  17.342  < 2e-16 ***
## year16             7.421e+00  6.009e-01  12.349  < 2e-16 ***
## year17             9.077e+00  6.503e-01  13.959  < 2e-16 ***
## year18             1.912e+01  8.104e-01  23.592  < 2e-16 ***
## year19             2.251e+01  8.818e-01  25.531  < 2e-16 ***
## year20             2.294e+01  8.973e-01  25.569  < 2e-16 ***
## year21             2.965e+01  1.199e+00  24.740  < 2e-16 ***
## month02           -1.885e+00  6.605e-01  -2.854 0.004318 ** 
## month03           -1.227e+00  6.478e-01  -1.894 0.058220 .  
## month04           -8.854e-01  6.683e-01  -1.325 0.185251    
## month05           -7.023e-01  6.723e-01  -1.045 0.296186    
## month06            3.168e+00  6.720e-01   4.714 2.43e-06 ***
## month07            1.939e-01  6.622e-01   0.293 0.769697    
## month08            1.814e+00  6.736e-01   2.692 0.007095 ** 
## month09            3.024e+00  6.811e-01   4.440 9.00e-06 ***
## month10            4.835e+00  6.854e-01   7.053 1.77e-12 ***
## month11            5.468e+00  6.813e-01   8.025 1.03e-15 ***
## month12            4.139e+00  6.817e-01   6.072 1.27e-09 ***
## ratio             -1.116e+02  3.212e+00 -34.744  < 2e-16 ***
## target_lagged_1    5.938e-01  2.971e-03 199.876  < 2e-16 ***
## target_lagged_24   1.793e-01  2.747e-03  65.268  < 2e-16 ***
## target_lagged_168  1.654e-01  2.716e-03  60.905  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.45 on 52200 degrees of freedom
##   (168 observations deleted due to missingness)
## Multiple R-squared:  0.8844, Adjusted R-squared:  0.8843 
## F-statistic: 1.736e+04 on 23 and 52200 DF,  p-value: < 2.2e-16
checkresiduals(fit)

## 
##  Breusch-Godfrey test for serial correlation of order up to 27
## 
## data:  Residuals
## LM test = 6334.2, df = 27, p-value < 2.2e-16

The lagged variables increased the R-squared values by approximately 0.2, and all variables are significant except some of the months. With achieving 0.88 adjusted R-squares, lets visualize the forecast and compare the model with ARMA model.

pred <- predict(fit, df_test)

model_forecast <- pred

df$forecast <- 0
df$forecast[(53112-30*24+1):53112] <-model_forecast


p <- ggplot()+
  geom_line(data = df[52000:53112],aes(x = NewDate,y=mcp_try,color='Actual Data') ) +
  geom_line(data = df[(53112-30*14+1):53112],aes(x = NewDate,y=forecast,color='Prediction')) +
                     scale_color_manual(values = c(
    'Actual Data' = '#007cc3',
    'Prediction' = 'red')) +   labs(title = 'Hourly Electricity Price(TL/MWh) Prediction', x= "Date", y = "Hourly Electricity Price(TL/MWh)",color = 'Legend')

 theme(text=element_text(color="#004785"),axis.text=element_text(color="#004785"))
## List of 2
##  $ text     :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : chr "#004785"
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text:List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : chr "#004785"
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE
p

From the visualization, the model does a pretty good job for predicting the 30 days of electricity price. To understand better, MAPE and WMAPE values should be checked:

metrics <- df[(53112-30*24+1):53112] %>% group_by(year,day) %>% summarize(MAPE = mean(abs((mcp_try-pred)/mcp_try)) * 100)
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
metrics
wmape <- abs(sum((df[(53112-30*24+1):53112]$mcp_try-df[(53112-24*30+1):53112]$forecast)*metrics$MAPE*100))/sum(df[(53112-24*30+1):53112]$mcp_try)
wmape
## [1] 6.02005

WMAPE is approximately 6%. It is a little bit worse than the best ARMA model (approx. WMAPE 5.98) High MAPE values are observed in the same days for both models.

6. Conclusion

In conclusion, The hourly electricity price data can be predicted accurately both with ARMA models and regression models. However, I would expect regression model can do better, since we introduced production, renewable energy production data, etc. into the model. For future work, a more detailed regression analysis should be done, and more variables can be further leveraged with feature engineering. Also, adding variables one by one to the regression model would be better for the analysis, I added all of the variables at the first model, due to time restrictions.